Linear Logistic Regression with NumPy pure code and a comparison wit Sklearn model



Libraries¶

In [1]:
import numpy as np
from sklearn import datasets, linear_model, metrics
from sklearn.linear_model import LinearRegression
import plotly.express as px
import pandas as pd

Dataset¶

7.1.2. Diabetes dataset¶

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

Number of Instances:

442

Number of Attributes:

First 10 columns are numeric predictive values

Target:

Column 11 is a quantitative measure of disease progression one year after baseline

Attribute Information:
  • age age in years

  • sex

  • bmi body mass index

  • bp average blood pressure

  • s1 tc, total serum cholesterol

  • s2 ldl, low-density lipoproteins

  • s3 hdl, high-density lipoproteins

  • s4 tch, total cholesterol / HDL

  • s5 ltg, possibly log of serum triglycerides level

  • s6 glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of n_samples (i.e. the sum of squares of each column totals 1).

Source URL: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see: Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499. (https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

Fot more information, see the original website below
https://scikit-learn.org/stable/datasets/toy_dataset.html

In [2]:
X,Y=datasets.load_diabetes(return_X_y=True)

Dataset shape

In [3]:
Y.shape
Out[3]:
(442,)
In [4]:
X.shape
Out[4]:
(442, 10)

Organize data in Pandas DataFrame

In [5]:
inp=['x'+str(i) for i in range(X.shape[1])]
inp+['y']
Out[5]:
['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'y']
In [6]:
df=pd.concat([pd.DataFrame(X,columns=inp),pd.DataFrame(Y,columns=['y'])],axis=1)
df.head(3)
Out[6]:
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 y
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0

Explanation of each column of dataframe

In [7]:
string='''age age in years
sex
bmi body mass index
bp average blood pressure
s1 tc, total serum cholesterol
s2 ldl, low-density lipoproteins
s3 hdl, high-density lipoproteins
s4 tch, total cholesterol / HDL
s5 ltg, possibly log of serum triglycerides level
s6 glu, blood sugar level
quantitative measure of disease progression one year after baseline'''
In [8]:
print(pd.DataFrame(zip(inp+['y'],string.split('\n')),columns=['column','meaning']).to_markdown(index=False))
| column   | meaning                                                             |
|:---------|:--------------------------------------------------------------------|
| x0       | age age in years                                                    |
| x1       | sex                                                                 |
| x2       | bmi body mass index                                                 |
| x3       | bp average blood pressure                                           |
| x4       | s1 tc, total serum cholesterol                                      |
| x5       | s2 ldl, low-density lipoproteins                                    |
| x6       | s3 hdl, high-density lipoproteins                                   |
| x7       | s4 tch, total cholesterol / HDL                                     |
| x8       | s5 ltg, possibly log of serum triglycerides level                   |
| x9       | s6 glu, blood sugar level                                           |
| y        | quantitative measure of disease progression one year after baseline |

Dataset analysis¶

Output correlation

In [9]:
px.bar(df.corr()['y'][inp])

Generate descriptive statistics of the dataframe

In [10]:
df.describe()
Out[10]:
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 y
count 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 4.420000e+02 442.000000
mean -2.511817e-19 1.230790e-17 -2.245564e-16 -4.797570e-17 -1.381499e-17 3.918434e-17 -5.777179e-18 -9.042540e-18 9.293722e-17 1.130318e-17 152.133484
std 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 4.761905e-02 77.093005
min -1.072256e-01 -4.464164e-02 -9.027530e-02 -1.123988e-01 -1.267807e-01 -1.156131e-01 -1.023071e-01 -7.639450e-02 -1.260971e-01 -1.377672e-01 25.000000
25% -3.729927e-02 -4.464164e-02 -3.422907e-02 -3.665608e-02 -3.424784e-02 -3.035840e-02 -3.511716e-02 -3.949338e-02 -3.324559e-02 -3.317903e-02 87.000000
50% 5.383060e-03 -4.464164e-02 -7.283766e-03 -5.670422e-03 -4.320866e-03 -3.819065e-03 -6.584468e-03 -2.592262e-03 -1.947171e-03 -1.077698e-03 140.500000
75% 3.807591e-02 5.068012e-02 3.124802e-02 3.564379e-02 2.835801e-02 2.984439e-02 2.931150e-02 3.430886e-02 3.243232e-02 2.791705e-02 211.500000
max 1.107267e-01 5.068012e-02 1.705552e-01 1.320436e-01 1.539137e-01 1.987880e-01 1.811791e-01 1.852344e-01 1.335973e-01 1.356118e-01 346.000000

Box plot of the output

In [11]:
fig = px.box(df, y="y", height=400, width=400,)
fig.show()

Linear regression¶

This section explain a linear regression (LR) realisation only by using NumPy libraries, and a comparison with LR of sklearn library


[This section may have some formulas copied from the wikipedia website below](https://en.wikipedia.org/wiki/Linear_regression)
https://en.wikipedia.org/wiki/Linear_regression

Reshape the output¶

In [12]:
y=Y.reshape((len(Y),1))
y.shape
Out[12]:
(442, 1)

Add 1 to X¶


This transformation allow to transform a nonlinear equation like y=ax + b to a linear one like y=AX, see below

$ {\displaystyle \mathbf {X} ={\begin{bmatrix}\mathbf {x} _{1}^{\mathsf {T}}\\\mathbf {x} _{2}^{\mathsf {T}}\\\vdots \\\mathbf {x} _{n}^{\mathsf {T}}\end{bmatrix}}={\begin{bmatrix}1&x_{11}&\cdots &x_{1p}\\1&x_{21}&\cdots &x_{2p}\\\vdots &\vdots &\ddots &\vdots \\1&x_{n1}&\cdots &x_{np}\end{bmatrix}},} $

In [13]:
UN=(np.zeros(X.shape[0])+1).reshape((X.shape[0],1))
Xu=np.concatenate(( UN,X), axis=1)
In [14]:
X[0:3,0:3]
Out[14]:
array([[ 0.03807591,  0.05068012,  0.06169621],
       [-0.00188202, -0.04464164, -0.05147406],
       [ 0.08529891,  0.05068012,  0.04445121]])
In [15]:
Xu[0:3,0:4]
Out[15]:
array([[ 1.        ,  0.03807591,  0.05068012,  0.06169621],
       [ 1.        , -0.00188202, -0.04464164, -0.05147406],
       [ 1.        ,  0.08529891,  0.05068012,  0.04445121]])
In [16]:
X.shape
Out[16]:
(442, 10)
In [17]:
Xu.shape
Out[17]:
(442, 11)
In [ ]:
 

Equarion Y= X . B¶

Shape verification¶

Handly


y = X . B
[442,1]=[442,11][11,1]

Using 0 matrix juste to verify the shape

In [18]:
# Using 0 matrix juste to verify the shape
B=np.zeros((Xu.shape[1],y.shape[1]))
B.shape
Out[18]:
(11, 1)
In [19]:
Xu.dot(B).shape    ,    y.shape
Out[19]:
((442, 1), (442, 1))

The loss function¶

Now putting the independent and dependent variables in matrices *X* and *Y* respectively, the loss function can be rewritten as:



$ {\displaystyle {\begin{aligned}L\left(D,{\vec {\beta }}\right)&=\|X{\vec {\beta }}-Y\|^{2}\\&=\left(X{\vec {\beta }}-Y\right)^{\textsf {T}}\left(X{\vec {\beta }}-Y\right)\\&=Y^{\textsf {T}}Y-Y^{\textsf {T}}X{\vec {\beta }}-{\vec {\beta }}^{\textsf {T}}X^{\textsf {T}}Y+{\vec {\beta }}^{\textsf {T}}X^{\textsf {T}}X{\vec {\beta }}\end{aligned}}} $

Loss calculation

In [20]:
loss=(Xu.dot(B)-y).T.dot(Xu.dot(B)-y)
loss.shape
Out[20]:
(1, 1)

Remove the extra dimension

In [21]:
np.squeeze(loss)
Out[21]:
array(12850921.)
In [ ]:
 

Gradien of the loss- funciton: dLoss/dB¶

As the loss is convex the optimum solution lies at gradient zero. The gradient of the loss function is (using Denominator layout convention):

$ {\displaystyle {\begin{aligned}{\frac {\partial L\left(D,{\vec {\beta }}\right)}{\partial {\vec {\beta }}}}&={\frac {\partial \left(Y^{\textsf {T}}Y-Y^{\textsf {T}}X{\vec {\beta }}-{\vec {\beta }}^{\textsf {T}}X^{\textsf {T}}Y+{\vec {\beta }}^{\textsf {T}}X^{\textsf {T}}X{\vec {\beta }}\right)}{\partial {\vec {\beta }}}}\\&=-2X^{\textsf {T}}Y+2X^{\textsf {T}}X{\vec {\beta }}\end{aligned}}} $

In [22]:
dloss=-2*Xu.T.dot(y)+2*Xu.T.dot(Xu).dot(B)
dloss.shape,B.shape
Out[22]:
((11, 1), (11, 1))

Setting the gradient to zero produces the optimum parameter:

${\displaystyle {\begin{aligned}-2X^{\textsf {T}}Y+2X^{\textsf {T}}X{\vec {\beta }}&=0\\\Rightarrow X^{\textsf {T}}X{\vec {\beta }}&=X^{\textsf {T}}Y\\\Rightarrow {\vec {\hat {\beta }}}&=\left(X^{\textsf {T}}X\right)^{-1}X^{\textsf {T}}Y\end{aligned}}}$

In [23]:
# Optimal B that give a dLoss/dB(Bopt)=0
Bopt=np.linalg.inv(Xu.T.dot(Xu)).dot(Xu.T.dot(y))
Bopt.shape
Out[23]:
(11, 1)

dLoss/dB at Bopt is close to 0

In [24]:
dloss=-2*Xu.T.dot(y)+2*Xu.T.dot(Xu).dot(Bopt)
np.sum(dloss).sum()
Out[24]:
2.2652102416031994e-11

Calculate the with the Bopt Ŷ¶

In [25]:
df['y_hat']=Xu.dot(Bopt)
In [26]:
px.line(df,y=['y','y_hat'])

Make our LR object¶

Definition of the object¶

In [27]:
class linear_regression():
    def __init__(self):
        self.B=0
    def add_one(self,X):
        # This methode add one to X input 
        UN=(np.zeros(X.shape[0])+1).reshape((X.shape[0],1))
        return np.concatenate(( UN,X), axis=1)
    def fit(self,X,Y):
        # This method calculate the optimal B
        Xu=self.add_one(X)
        Bopt=np.linalg.inv(Xu.T.dot(Xu)).dot(Xu.T.dot(Y))
        self.B=Bopt
    def predict(self,X):
        # This method allows object to predict the Y
        B=self.B
        Xu=self.add_one(X)
        y_hat=Xu.dot(B)
        return y_hat
    def loss(self,X,Y):
        # This method calculate the loss with the B parameters
        B=self.B
        Xu=self.add_one(X)
        loss=(Xu.dot(B)-Y).T.dot(Xu.dot(B)-Y)
        return loss
    def dloss(self,X,Y):
        # This method calculate the dLoss/dB with the B parameters
        B=self.B
        Xu=self.add_one(X)
        dloss=-2*Xu.T.dot(Y)+2*Xu.T.dot(Xu).dot(B)
        return dloss

Use the LR object¶

In [28]:
LR=linear_regression()
In [29]:
LR.fit(X,Y)
In [30]:
y_hat2=LR.predict(X)
In [31]:
LR.loss(X,Y)
Out[31]:
1263985.7856333433

dLoss/dB is null @ Bopt

In [32]:
np.abs(LR.dloss(X,Y)).sum()
Out[32]:
4.3513637137948535e-11

Calculate Y_hat

In [33]:
df['y_hat2']=y_hat2
(df.y_hat-df.y_hat2).abs().sum()
Out[33]:
0.0

Comparaison the first y_hat with the second one calculated by the LR object

In [34]:
px.line(df,y=['y_hat','y_hat2'])

Comparaison between the local LR object and sklearn.linear_model LR¶

In [35]:
from sklearn.linear_model import LinearRegression
In [36]:
reg = LinearRegression()
reg.fit(X,Y)
Out[36]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [37]:
y_hat_sklearn=reg.predict(X)
In [38]:
y_hat_sklearn.shape
Out[38]:
(442,)
In [39]:
df['y_hat_sklearn']=y_hat_sklearn

Sklearn.linear_model and the local LR have the same output y

In [40]:
px.line(df,y=['y_hat2','y_hat_sklearn'])

The error between both outputs is close to 0

In [41]:
(df.y_hat2-df.y_hat_sklearn).abs().sum()
Out[41]:
5.17630383001233e-11

The confessions of both model are the same, the except difference is that the 1 in our model is in the beginning instead of the end of B Matrix

In [42]:
reg.coef_[:-1]
Out[42]:
array([ -10.0098663 , -239.81564367,  519.84592005,  324.3846455 ,
       -792.17563855,  476.73902101,  101.04326794,  177.06323767,
        751.27369956])
In [43]:
LR.B[1:]
Out[43]:
array([ -10.0098663 , -239.81564367,  519.84592005,  324.3846455 ,
       -792.17563855,  476.73902101,  101.04326794,  177.06323767,
        751.27369956,   67.62669218])

Local LR model with multi dimensional output¶

Make a new output with 2 dimensions from Y with a same formula

In [44]:
Y.shape
Out[44]:
(442,)
In [45]:
Y2=np.stack((Y,0.5*Y),axis=1)
In [46]:
Y2.shape
Out[46]:
(442, 2)

Calculate the Ŷ with LR of sklearn model

In [47]:
reg = LinearRegression()
reg.fit(X,Y2)
y_hat_sklearn2=reg.predict(X)
y_hat_sklearn2.shape
Out[47]:
(442, 2)

Calculate the Ŷ with the local LR object

In [48]:
LR=linear_regression()
LR.fit(X,Y2)
y2_hat=LR.predict(X)

The error between both outputs is close to 0

In [49]:
np.abs(y_hat_sklearn2-y2_hat).sum()
Out[49]:
1.4384227142727468e-10

The loss at 0 at Bopt

In [50]:
LR.loss(X,Y2)
Out[50]:
array([[1263985.78563334,  631992.89281667],
       [ 631992.89281667,  315996.44640834]])

dLoss/dB is close to 0 at Bopt

In [51]:
np.abs(LR.dloss(X,Y2)).sum()
Out[51]:
7.004530289123068e-11

Conclusion¶

A local linear regression (LR) object was generated only by pure numpy code, and this object has the same behaviours of Sklearn LR.


This exercise helps understanding the linear logistic regression algorithm